{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Julia is fast\n",
    "\n",
    "Very often, benchmarks are used to compare languages.  These benchmarks can lead to long discussions, first as to exactly what is being benchmarked and secondly what explains the differences.  These simple questions can sometimes get more complicated than you at first might imagine.\n",
    "\n",
    "The purpose of this notebook is for you to see a simple benchmark for yourself.  One can read the notebook and see what happened on the author's ThinkPad with a 4-core Intel Core I7, or run the notebook yourself.\n",
    "\n",
    "(This material began life as a wonderful lecture by Steven Johnson at MIT: <https://github.com/stevengj/18S096/blob/master/lectures/lecture1/Boxes-and-registers.ipynb>.)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# `sum`: An easy enough function to understand"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Consider the  **sum** function `sum(a)`, which computes\n",
    "\n",
    "$$\n",
    "\\mathrm{sum}(a) = \\sum_{i=1}^n a_i.\n",
    "$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Seed the random number generator so that it will\n",
    "# produce the same set of random numbers. \n",
    "#\n",
    "# You don't generally need to do this, but it ensures\n",
    "# that you'll see exactly the same results we do for\n",
    "# this demo. \n",
    "srand(42);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "scrolled": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "10000000-element Array{Float64,1}:\n",
       " 0.533183 \n",
       " 0.454029 \n",
       " 0.0176868\n",
       " 0.172933 \n",
       " 0.958926 \n",
       " 0.973566 \n",
       " 0.30387  \n",
       " 0.176909 \n",
       " 0.956916 \n",
       " 0.584284 \n",
       " 0.937466 \n",
       " 0.160006 \n",
       " 0.422956 \n",
       " ⋮        \n",
       " 0.328823 \n",
       " 0.116738 \n",
       " 0.30473  \n",
       " 0.749187 \n",
       " 0.301049 \n",
       " 0.530005 \n",
       " 0.230942 \n",
       " 0.563132 \n",
       " 0.416549 \n",
       " 0.526003 \n",
       " 0.851597 \n",
       " 0.291042 "
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "a = rand(10^7) # array of random numbers, uniform on [0,1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5.000811321070729e6"
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "sum(a) # one expects this is 10^7 * .5 , since the mean of each entry is .5"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Benchmarking a few implementations in different languages"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "\u001b[1m\u001b[36mINFO: \u001b[39m\u001b[22m\u001b[36mRecompiling stale cache file /home/rdeits/.julia/lib/v0.6/BenchmarkTools.ji for module BenchmarkTools.\n",
      "\u001b[39m"
     ]
    }
   ],
   "source": [
    "using BenchmarkTools: @btime, @benchmark\n",
    "\n",
    "# BenchmarkTools provides the @benchmark and @btime macros,\n",
    "# which run a given function multiple times to provide a \n",
    "# robust estimate of its true runtime. "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#  1. The C language: (8.0 msecs)\n",
    "\n",
    "C is often considered the gold standard: difficult on the human, nice for the machine. Getting within a factor of 2 of C is often satisfying. Nonetheless, even within C, there are many kinds of optimizations possible that a naive C writer may or may not get the advantage of.\n",
    "\n",
    "You can put C code in a Julia session, compile it, and run it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "c_sum (generic function with 1 method)"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "C_code = \"\"\"\n",
    "#include <stddef.h>\n",
    "double c_sum(size_t n, double *X) {\n",
    "    double s = 0.0;\n",
    "    size_t i;\n",
    "    for (i = 0; i < n; ++i) {\n",
    "        s += X[i];\n",
    "    }\n",
    "    return s;\n",
    "}\n",
    "\"\"\"\n",
    "\n",
    "const Clib = tempname()   # make a temporary file\n",
    "\n",
    "\n",
    "# compile to a shared library by piping C_code to gcc\n",
    "# (works only if you have gcc installed):\n",
    "\n",
    "open(`gcc -fPIC -O3 -msse3 -xc -shared -o $(Clib * \".\" * Libdl.dlext) -`, \"w\") do f\n",
    "    print(f, C_code) \n",
    "end\n",
    "\n",
    "# define a Julia function that calls the C function implemnted above:\n",
    "c_sum(X::Array{Float64}) = ccall((\"c_sum\", Clib), Float64, (Csize_t, Ptr{Float64}), length(X), X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5.000811321070464e6"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c_sum(a)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "true"
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "c_sum(a) ≈ sum(a) # type \\approx and then <TAB> to get the ≈ symbol"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can now benchmark the C code directly from Julia:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  8.017 ms (0 allocations: 0 bytes)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "5.000811321070464e6"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@btime c_sum($a)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 2. Python's built in `sum` (74 msecs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# Julia interface to Python:\n",
    "using PyCall"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PyObject <built-in function sum>"
      ]
     },
     "execution_count": 10,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# call a low-level PyCall function to get a Python list, because\n",
    "# by default PyCall will convert to a NumPy array instead (we benchmark NumPy below):\n",
    "\n",
    "apy_list = PyCall.array2py(a, 1, 1)\n",
    "\n",
    "# get the Python built-in \"sum\" function:\n",
    "pysum = pybuiltin(\"sum\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5.000811321070464e6"
      ]
     },
     "execution_count": 11,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pysum(a)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "true"
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "pysum(a) ≈ sum(a)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  74.272 ms (17 allocations: 512 bytes)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "5.000811321070464e6"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@btime $pysum($apy_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. Python: `numpy` (4.5 msec)  \n",
    "\n",
    "## Takes advantage of hardware \"SIMD\"\n",
    "\n",
    "`numpy` is an optimized C library, callable from Python"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If it is not installed, install it from Julia as follows:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "# using Conda \n",
    "# Conda.add(\"numpy\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5.0008113210707335e6"
      ]
     },
     "execution_count": 15,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "numpy_sum = pyimport(\"numpy\")[\"sum\"]\n",
    "apy_numpy = PyObject(a) # converts to a numpy array by default\n",
    "\n",
    "numpy_sum(apy_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "true"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "numpy_sum(apy_list) ≈ sum(a)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  4.579 ms (17 allocations: 512 bytes)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "5.0008113210707335e6"
      ]
     },
     "execution_count": 17,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@btime $numpy_sum($apy_numpy)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4. Python, hand written (319 msec!)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "PyObject <function mysum at 0x7ff03bcf47d0>"
      ]
     },
     "execution_count": 18,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# The PyCall package lets us define python functions directly from Julia:\n",
    "\n",
    "py\"\"\"\n",
    "def mysum(a):\n",
    "    s = 0.0\n",
    "    for x in a:\n",
    "        s = s + x\n",
    "    return s\n",
    "\"\"\"\n",
    "\n",
    "# mysum_py is a reference to the Python mysum function\n",
    "mysum_py = py\"\"\"mysum\"\"\"o"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "5.000811321070464e6"
      ]
     },
     "execution_count": 19,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mysum_py(apy_list)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "true"
      ]
     },
     "execution_count": 20,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "mysum_py(apy_list) ≈ sum(a)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  319.106 ms (17 allocations: 512 bytes)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "5.000811321070464e6"
      ]
     },
     "execution_count": 21,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@btime $mysum_py($apy_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 5. Julia (built-in) (4.4 msec) \n",
    "\n",
    "## Written directly in Julia, not in C! \n",
    "\n",
    "(and just as fast as numpy's optimized `sum()`)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/html": [
       "sum(a) at <a href=\"https://github.com/JuliaLang/julia/tree/d386e40c17d43b79fc89d3e579fc04547241787c/base/reduce.jl#L359\" target=\"_blank\">reduce.jl:359</a>"
      ],
      "text/plain": [
       "sum(a) in Base at reduce.jl:359"
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@which sum(a)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  4.401 ms (0 allocations: 0 bytes)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "5.000811321070729e6"
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@btime sum($a)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Note that, like NumPy, Julia's built-in `sum` uses [pairwise summation](https://en.wikipedia.org/wiki/Pairwise_summation) for improved numerical accuracy."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 6. Julia (hand-written) (8.0 msec, same as hand-written C)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "mysum (generic function with 1 method)"
      ]
     },
     "execution_count": 24,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "function mysum(A)   \n",
    "    s = 0.0  # s = zero(eltype(A))\n",
    "    for a in A\n",
    "        s += a\n",
    "    end\n",
    "    s\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {
    "scrolled": true
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  8.018 ms (0 allocations: 0 bytes)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "5.000811321070464e6"
      ]
     },
     "execution_count": 25,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@btime mysum($a)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 7. Julia (hand-written and using SIMD instructions) (4.2 msec)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "mysum_simd (generic function with 1 method)"
      ]
     },
     "execution_count": 26,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "function mysum_simd(A)   \n",
    "    s = 0.0  # s = zero(eltype(A))\n",
    "    @simd for a in A\n",
    "        s += a\n",
    "    end\n",
    "    s\n",
    "end"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  4.253 ms (0 allocations: 0 bytes)\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "5.000811321070714e6"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "@btime mysum_simd($a)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Conclusion\n",
    "\n",
    "It's often pretty straightforward to write Julia code which is as fast as well-written C code! Loops in Julia are fast, so feel free to use them whenever they help organize or clarify your code. For the innermost loops of performance-critical code, you can exploit `simd` instructions to get even better performance. "
   ]
  }
 ],
 "metadata": {
  "anaconda-cloud": {},
  "kernelspec": {
   "display_name": "Julia 0.6.2",
   "language": "julia",
   "name": "julia-0.6"
  },
  "language_info": {
   "file_extension": ".jl",
   "mimetype": "application/julia",
   "name": "julia",
   "version": "0.6.2"
  },
  "toc": {
   "nav_menu": {
    "height": "212px",
    "width": "252px"
   },
   "navigate_menu": true,
   "number_sections": true,
   "sideBar": true,
   "threshold": "2",
   "toc_cell": false,
   "toc_section_display": "block",
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
